Add interface to launch parallel dygraph by multiprocessing #26044

chenwhql · 2020-08-07T10:19:16Z

PR types

New features

PR changes

APIs

Describe

This PR add multiprocessing start method start_processes and spawnfor dygraph data parallel training.

1. Start method difference

start by launch

python -m paddle.distributed.launch --selected_gpus=0,1 train.py

start by spawn

python train.py

and add spawn in __main__ method, for example:

paddle.distributed.spawn(train_mnist,
    args=(args,),
    nprocs=args.nprocs,
    join=True)

2. Simple example

from __future__ import print_function

import paddle
import paddle.nn as nn
import paddle.optimizer as opt
import paddle.distributed as dist

class LinearNet(nn.Layer):
    def __init__(self):
        super(LinearNet, self).__init__()
        self._linear1 = nn.Linear(10, 10)
        self._linear2 = nn.Linear(10, 1)
        
    def forward(self, x):
        return self._linear2(self._linear1(x))

def train(print_result=False):
    # 1. enable dynamic mode
    paddle.disable_static()
    
    # 2. initialize parallel environment
    dist.init_parallel_env()

    # 3. create data parallel layer & optimizer
    layer = LinearNet()
    dp_layer = paddle.DataParallel(layer)

    loss_fn = nn.MSELoss()
    adam = opt.Adam(
        learning_rate=0.001, parameters=dp_layer.parameters())

    # 4. run layer
    inputs = paddle.randn([10, 10], 'float32')
    outputs = dp_layer(inputs)
    labels = paddle.randn([10, 1], 'float32')
    loss = loss_fn(outputs, labels)
    
    if print_result is True:
        print("loss:", loss.numpy())
    
    loss = dp_layer.scale_loss(loss)
    loss.backward()
    dp_layer.apply_collective_grads()

    adam.step()
    adam.clear_grad()

# Usage 1: only pass function. 
# If your training method no need any argument, and 
# use all visible devices for parallel training. 
if __name__ == '__main__':
    dist.spawn(train)

# Usage 2: pass function and arguments.
# If your training method need some arguments, and 
# use all visible devices for parallel training.
if __name__ == '__main__':
    dist.spawn(train, args=(True,))

# Usage 3: pass function, arguments and nprocs.
# If your training method need some arguments, and 
# only use part of visible devices for parallel training.
# If your machine hold 8 cards {0,1,2,3,4,5,6,7},
# this case will use cards {0,1}; If you set 
# CUDA_VISIBLE_DEVICES=4,5,6,7, this case will use
# cards {4,5}
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2)

# Usage 4: pass function, arguments, nprocs and selected_gpus.
# If your training method need some arguments, and 
# only use part of visible devices for parallel training,
# but you can't set your machine's environment varibale 
# CUDA_VISIBLE_DEVICES, such as it is None or all cards
# {0,1,2,3,4,5,6,7}, you can pass `selelcted_gpus` to 
# select the GPU cards you want to use. For example,
# this case will use cards {4,5} if your machine hold 8 cards.
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2, selelcted_gpus='4,5')

3. API change

Add 4 new apis:
- paddle.distributed.spawn: start mulit-process training by spawn method
- paddle.distributed.init_parallel_env: init parallel environment variables & get paralllel strategy
- paddle.distributed.get_rank: get current process rank
- paddle.distributed.get_world_size: get current world size
Move 2 old apis:
- paddle.prepare_context (fluid.dygraph.prepare_context) -> paddle.distributed.prepare_context
- paddle.ParallelEnv (fluid.dygraph.ParallelEnv) -> paddle.distributed.ParallelEnv
Refine 1 old api:
- paddle.DataParallel (fluid.dygraph.DataParallel): Set strategy as an optional argument
Deprecate 1 old apis:
- paddle.distributed.prepare_context (fluid.dygraph.prepare_context): replace by paddle.distributed.init_parallel_env later

4. Correctness

Verify the correctness of the interface in the following models:

Mnist: test_parallel_dygraph_mnist.py
SeResNext: test_parallel_dygraph_se_resnext.py
Transformer: test_parallel_dygraph_transformer.py

5. Related docs

FluidDoc PR: Add paddle.distributed dir and docs docs#2482

… dygraph/add_multiprocess_run_interface

python/paddle/distributed/parallel.py

python/paddle/fluid/dygraph/parallel.py

python/paddle/distributed/parallel.py

python/paddle/distributed/spawn.py

python/paddle/distributed/parallel.py

… dygraph/add_multiprocess_run_interface

gongweibao

Spawn 模式最好有性能对比？

guru4elephant

LGTM

guru4elephant · 2020-08-28T04:50:44Z

python/paddle/distributed/parallel.py

+ParallelStrategy = core.ParallelStrategy
+
+
+def init_parallel_env(backend='nccl'):


NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.

thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick

guru4elephant

please remove the backend argument for simplicity

chenwhql · 2020-08-28T05:18:08Z

Spawn 模式最好有性能对比？

感谢意见，确实应该有的，我后续出个报告可以吗？这个接口开发工作开展的时间有点短，近一周一直在讨论迭代接口形态，这个又要随2.0-beta发布，所以仅验证了正确性，性能对比还没来得及开展

这个接口在理论上与launch并无差别，只是换了一种多进程的启动方式，没有增加多余的实现，理论上不会有差别，同时这只是一种可选的启动方式，也不影响launch原来的使用

XiaoguangHu01

LGTM

jzhang533

lgtm

raindrops2sea

LGTM

chenwhql added 2 commits August 7, 2020 10:16

add dygraph parallel run interface

97b8bdc

polish implement & unified env property name

00b56d5

PaddlePaddle deleted a comment from paddle-bot-old bot Aug 7, 2020

chenwhql added 19 commits August 10, 2020 05:17

add print config arg

17f7fe9

refactor init_parallel_env function

07c86aa

Compatible with multiprocessing and launch modes

4c955a1

set default trainer start port

523e007

support run in python 2

8101b03

polish python2 support code

d3b9a06

remove python2 support

48c46ff

refine launch import

b06d400

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e1df353

… dygraph/add_multiprocess_run_interface

polish dome design details

2c7b3fd

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

39fddff

… dygraph/add_multiprocess_run_interface

refactor api implemention & path

d26f495

use new method _set_expected_place

bf985cc

add spawn unittest framework & mnist test

7939384

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

95c0367

… dygraph/add_multiprocess_run_interface

add more unittests & doc

04580d8

fix unittest failed

131afd4

polish english doc

e170f10

self review and polish details

0ef215d

chenwhql requested review from guru4elephant, willthefrog, phlrain, lanxianghit, XiaoguangHu01, danleifeng and gavin1332 August 25, 2020 04:48

willthefrog reviewed Aug 25, 2020

View reviewed changes

python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved

python/paddle/distributed/parallel.py Show resolved Hide resolved

python/paddle/fluid/dygraph/parallel.py Show resolved Hide resolved

XiaoguangHu01 reviewed Aug 27, 2020

View reviewed changes

chenwhql added 4 commits August 27, 2020 04:11

polish details by xiaoguang's comment

cca82b6

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

82223a6

… dygraph/add_multiprocess_run_interface

verify correctly when spawn nprocs=-1

d39331c

resolve collective api conflict

10df04c

chenwhql requested review from XiaoguangHu01 and guru4elephant August 27, 2020 06:21

chenwhql added 7 commits August 27, 2020 11:59

refactor spawn & init_parallel_env design

3a2d7e8

polish doc details

0582c4b

open spawn unittests

9ceaeff

try to fix doc compile error

4b7d810

try to fix unknown doc format error

4261e22

add skip unittest when not gpu

cad6872

resolve develop conflict

377c919

gongweibao requested changes Aug 28, 2020

View reviewed changes

guru4elephant approved these changes Aug 28, 2020

View reviewed changes

guru4elephant self-requested a review August 28, 2020 05:11

chenwhql closed this Aug 28, 2020

chenwhql reopened this Aug 28, 2020

guru4elephant reviewed Aug 28, 2020

View reviewed changes

XiaoguangHu01 approved these changes Aug 28, 2020

View reviewed changes

gongweibao approved these changes Aug 28, 2020

View reviewed changes

jzhang533 approved these changes Aug 28, 2020

View reviewed changes

chenwhql requested a review from kolinwei August 28, 2020 06:12

raindrops2sea approved these changes Aug 28, 2020

View reviewed changes

kolinwei approved these changes Aug 28, 2020

View reviewed changes

chenwhql merged commit 31f422a into PaddlePaddle:develop Aug 28, 2020

This was referenced Aug 28, 2020

Add paddle.distributed dir and docs PaddlePaddle/docs#2482

Merged

Remove backend argument of init_parallel_env #26773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interface to launch parallel dygraph by multiprocessing #26044

Add interface to launch parallel dygraph by multiprocessing #26044

chenwhql commented Aug 7, 2020 •

edited

Loading

gongweibao left a comment

guru4elephant left a comment

guru4elephant Aug 28, 2020

chenwhql Aug 28, 2020

guru4elephant left a comment

chenwhql commented Aug 28, 2020

XiaoguangHu01 left a comment

jzhang533 left a comment

raindrops2sea left a comment

		ParallelStrategy = core.ParallelStrategy


		def init_parallel_env(backend='nccl'):

Add interface to launch parallel dygraph by multiprocessing #26044

Add interface to launch parallel dygraph by multiprocessing #26044

Conversation

chenwhql commented Aug 7, 2020 • edited Loading

PR types

PR changes

Describe

1. Start method difference

2. Simple example

3. API change

4. Correctness

5. Related docs

gongweibao left a comment

Choose a reason for hiding this comment

guru4elephant left a comment

Choose a reason for hiding this comment

guru4elephant Aug 28, 2020

Choose a reason for hiding this comment

chenwhql Aug 28, 2020

Choose a reason for hiding this comment

guru4elephant left a comment

Choose a reason for hiding this comment

chenwhql commented Aug 28, 2020

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

jzhang533 left a comment

Choose a reason for hiding this comment

raindrops2sea left a comment

Choose a reason for hiding this comment

chenwhql commented Aug 7, 2020 •

edited

Loading